Bioinformatics Advances — Latest Matching Preprints

1

EcoXAI: Autonomous Agentic Ecosystem for Explainable Artificial Intelligence and Biomedical Discovery

Matsumoto, N.; Choi, H.; Freda, P. J.; Hernandez, M. E.; Wang, Z. P.; Moore, J. H.

2026-07-13 bioinformatics 10.64898/2026.07.08.737358 medRxiv

Top 0.1%

19.0%

Show abstract

MotivationAs biomedical datasets and knowledge graphs continue to grow in size, complexity, and heterogeneity, navigating and extracting actionable insights from them presents a major bottleneck for researchers. There is a clear need for autonomous analytical solutions that can utilize recent advancements in agentic AI such as agent harnessing and loop engineering without introducing hallucination or workflow fragmentation. Researchers, regardless of technical expertise, need tools that streamline complex data analysis and deliver meaningful, actionable insights grounded in both data and established biomedical knowledge. EcoXAI addresses this by introducing a modular, customizable, containerized multi-agent system that structures analysis into explicit pipeline execution stages, lowering the computational barrier for clinical and translational researchers. ResultEcoXAI replaces monolithic AI text interfaces with an autonomous execution-driven framework with specialized bioinformatics agents for delivering proactive, data-driven insights grounded in established biological knowledge. Unlike purely LLM-driven or less integrated AI solutions prone to hallucinations or biologically implausible outcomes, EcoXAIs multi-agent framework, which leverages modern agentic management and explicit knowledge graph integration, provides greater transparency and verifiability in its reasoning. In our use case in drug repurposing for Alzheimers Disease, EcoXAI evaluated 103 drug candidates and identified 79 novel candidates whose predictive models exceeded a randomized baseline, including the CCR5 antagonist Maraviroc, whose generated hypothesis was subsequently supported by the literature. These results demonstrate the potential of knowledge graph-grounded AI agents to accelerate hypothesis-driven biomedical research. Availability and implementationEcoXAI is available on GitHub at: https://github.com/EpistasisLab/EcoXAI. Contactjason.moore@csmc.edu

2

Quantifying Evidence for Competing Biomedical Hypotheses using Large Language Models and Bayesian Analysis

Moore, B. M.; Freeman, J.; Millikin, R. J.; Mohanty, C.; George, K. S.; Bal, A.; Lock, C.; Sauer, J.-D.; Spurgeon, M. E.; Moore, D. L.; Travers, B. G.; Stewart, R.

2026-06-07 bioinformatics 10.64898/2026.06.05.730173 medRxiv

Top 0.1%

18.4%

Show abstract

Science fundamentally depends on the generation and testing of hypotheses, many of them controversial. An explosion in scientific literature has made evaluating hypotheses even within a domain a problem of scale, and risks slowing an already extensive consensus-building process. While this challenge has prompted interest in automated hypothesis evaluation tools, existing methods have not yet proven effective for comparing hypotheses. Here, we introduce KM-GPT-DCH, an algorithm that combines co-occurrence methods with large language models (LLMs) to develop a transparent and reproducible literature-based algorithm to compare controversial hypotheses using a structured scoring approach with Bayesian methods to estimate confidence. When testing the algorithm on historical controversial hypotheses previously decided, KM-GPT-DCH chooses the correct hypothesis with high confidence several years before the scientific community or public do so. We further apply the algorithm to compare twenty unresolved controversial hypothesis pairs providing guidance for future research. The method can help researchers and the public to evaluate biomedical hypotheses such as "Is it more likely that monoamine deficiency or inflammation causes depression?" It can also be used to assess and visualize historical trends in the scientific literature. A web-based implementation of the algorithm is freely available at https://skim.morgridge.org.

3

SwiftNJ: Fast Exact Neighbour Joining via Correctness-Gated Coding Agents

Christensen, J.

2026-05-29 bioinformatics 10.64898/2026.05.28.728410 medRxiv

Top 0.1%

15.1%

Show abstract

The capability profile of frontier coding agents in 2026 varies sharply across technical domains, motivating domain-specific empirical study of where, and under what oversight conditions, such systems can contribute to specialised technical work. This paper presents one such study in computational phylogenetics. Neighbour joining (NJ) is a widely used distance-based method for inferring evolutionary trees in microbial epidemiology, comparative genomics, and large-scale sequence clustering. Its constant-factor runtime is set by hand-tuned native implementations; RapidNJ is a widely-cited representative of that class and serves here as the comparison baseline. We ask whether a current-generation coding agent, operating under a correctness-gated optimisation harness with deterministic correctness gates calibrated against a QuickTree reference, can advance that constant factor on a fixed benchmark. The resulting implementation, SwiftNJ, achieves a geometric-mean runtime ratio of 0.565 against a locally-rebuilt RapidNJ-native binary across a 59-matrix corpus, sub-parity on 58 of 59 matrices. On 400 shuffled inputs drawn from 16 small matrices (n [≤] 2000), SwiftNJ matched the QuickTree reference at Robinson-Foulds distance zero. In this domain, a correctness-gated coding agent meaningfully improved on a strong native baseline, suggesting that harness-guided optimisation holds promise for performance-critical bioinformatics tools; further work is needed to establish how broadly the approach generalises.

4

Are Current AI Virtual Cell Models Useful for Scientific Discovery?

Bereket, M. D.; Leskovec, J.

2026-04-25 bioinformatics 10.64898/2026.04.23.719015 medRxiv

Top 0.1%

14.8%

Show abstract

AI models are increasingly developed to predict the effect of perturbations on gene expression, but current benchmarks fail to reliably measure model performance. Here, we argue that new benchmarks that directly measure the value of model predictions for specific scientific discovery outcomes are needed to address this gap. We present PerturbHD, an evaluation framework for AI-enabled hit discovery, to demonstrate the benefits our proposed approach.

5

GeneFior: A back to basics and transparent multi-tool approach tosequence detection

Dimonaco, N. J.; Lawther, K.

2026-05-18 bioinformatics 10.64898/2026.05.15.724838 medRxiv

Top 0.1%

14.6%

Show abstract

The detection of sequences of interest, such as antimicrobial resistance genes, directly from genomic and metagenomic sequencing data has become routine, enabled by curated reference databases and rapid in silico sequence search tools. Yet most workflows depend on prior assembly, an inherently lossy process in which a substantial proportion of reads fail to assemble or are collapsed into consensus sequences, causing low-abundance variants and nucleotide-level diversity to be systematically obscured. The tools used to interrogate the resulting assemblies compound this further, clustering reference sequences at arbitrary identity thresholds, imposing hidden parameter defaults, and reducing intermediate alignment evidence to summarised outputs that cannot be critically evaluated or reproduced. Here we present GeneFior, a transparent, multitool workflow integrating BLAST, DIAMOND, Bowtie2, BWA, and Minimap2 to search both DNA and protein sequences against any user-supplied reference database. By enforcing genecentric identity and coverage thresholds at both the read and gene level, GeneFior reduces false positives while retaining sensitivity to genuine, low-abundance variants, including those differing at single-nucleotide resolution. Crucially, by exposing all alignment parameters, preserving intermediate outputs, and generating cross-tool consensus detection matrices, GeneFior makes the influence of tool choice, database selection, and parameter configuration on reported gene profiles directly observable and reproducible.

6

F.A.D.E. (Fully Agentic Drug Engine): A Conversational AI Platform for Drug Discovery

Kantorow, J.; Mani, N.; Mohanraj, N. R.; Zong, X.

2026-06-25 biophysics 10.64898/2026.06.20.733481 medRxiv

Top 0.1%

12.7%

Show abstract

Drug discovery remains one of the costliest and most time-intensive endeavors in the pharmaceutical pipeline, with average development costs exceeding $2.3 billion per drug, timelines spanning more than a decade, and attrition rates above 90% in clinical trials. While computational methods have expanded the searchable chemical space, current pipelines remain fragmented and largely inaccessible to researchers without deep interdisciplinary expertise. Here we present F.A.D.E. (Fully Agentic Drug Engine), a multi-agent, open-source platform that converts natural language queries into potential drug candidates, substantially lowering the expertise barrier to advanced computational drug discovery. F.A.D.E. employs a three-branch hierarchical architecture that adapts to the level of available structural data for any protein target, integrating structure prediction, binding pocket detection, equivariant diffusion-based de novo ligand generation, and binding affinity estimation into a single automated pipeline. We validate F.A.D.E. on two structurally distinct targets: the epidermal growth factor receptor kinase domain (EGFR), a well-established oncology target, and cellular retinol-binding protein 1 (CRBP1), a lipid-binding protein involved in retinoid metabolism. For EGFR, our generated candidates achieved QED scores of 0.85 compared to 0.46 for the co-crystallised reference ligand, demonstrating marked improvement in predicted drug-likeness. Results across both targets confirm that F.A.D.E. can reliably generate chemically tractable, drug-like hit compounds across diverse protein classes from simple natural language input.

7

pylimma: a faithful, AnnData-native Python port of R limma for differential expression analysis

Mulvey, J.

2026-07-10 bioinformatics 10.64898/2026.07.06.736732 medRxiv

Top 0.1%

12.6%

Show abstract

pylimma is a faithful Python port of limma, intended to bring one of the most widely used tools for differential expression analysis to the developing Python ecosystem for transcriptomics and proteomics. We validated pylimma against the existing R implementation through 227 function-level comparisons and across six real world datasets spanning microarray, RNAseq, proteomics and single-cell transcriptomics. pylimma reproduces limmas numerical output to a median agreement of 13 significant figures and calls identical sets of differentially expressed features and gene sets. This supports its use as a drop-in replacement for the R implementation.

8

An Isoform-Centric, Structure-Aware Framework for Protein Function Prediction and Evaluation, Instantiated in 3DisoDeepPF

Jiang, F.; Zhao, R.; Liang, F.; Zhang, Y.; Cui, T.; Zhao, X.; Wang, X.; Xu, m.; Shuai, Y.; Luo, T.; Yao, H.; Xu, C.; Wang, Z.; Zeng, W.; Jiang, X.; Tang, Z.; Zhang, W.; Heng, P. A.; Li, Y.; Wang, X.

2026-04-28 cancer biology 10.64898/2026.04.24.720502 medRxiv

Top 0.1%

12.0%

Show abstract

Understanding functional diversity across protein isoforms remains a long-standing challenge with broad biological and translational implications, yet most computational methods are developed and benchmarked on a single reference protein per gene, limiting their ability to resolve isoform-specific functional differences. This challenge is compounded by the scarcity of isoform-resolved annotations and benchmarks. Here, we present an isoform-centric, structure-aware framework for the protein family (Pfam) domain and Gene Ontology (GO) term prediction. We implemented this framework in 3DisoDeepPF, which combines a dense graph combining sequence and structure similarity with multimodal representations, and evaluated 3DisoDeepPF in both conventional and isoform-resolved settings. Across conventional canonical benchmarks, 3DisoDeepPF showed strong performance relative to representative methods in both GO and Pfam prediction tasks. In an isoform-specific breast cancer atlas, 3DisoDeepPF remained stable under homology-controlled evaluation and detected Pfam changes among isoforms from the same gene. Additionally, 3DisoDeepPF provides evidence-tracing utilities that trace predicted labels to associated protein nodes, enabling supporting traceability and biological plausibility assessment.

9

Development of the Mitochondrial Base Editor Analysis Package (MitoBEAP).

Mutti, C. D.; Nash, P.; Silva-Pinheiro, P.; Minczuk, M.; Van Haute, L.

2026-06-05 bioinformatics 10.64898/2026.06.02.729539 medRxiv

Top 0.1%

12.0%

Show abstract

For many years, the genetic manipulation of mitochondrial DNA was largely hampered by inefficient delivery of nucleic acids to mitochondria. However, the development of mitoCBEs, such as mitochondrial cytosine base editors (DdCBEs), which catalyse C*G-to-T*A conversions, and more recently, mitoABEs, such as transcription-activator-like effector (TALE)-linked deaminases (TALEDs) enabling A*T-to-G*C conversion, has transformed this field. Generally, mitochondrial base editors exhibit high on-target efficiency and are straightforward to design and use. Nonetheless, unintended off-target effects cannot be overlooked and should be assessed consistently with each experiment, which can be challenging without specialised bioinformatic expertise. Here, we introduce Mitochondrial Base Editor Analysis Package (MitoBEAP), which, to our knowledge, is the first R package specifically designed to analyse next-generation sequencing data from base-edited mtDNA samples. The package facilitates the analysis of potential off-target effects, offers multiple visualisation options, and allows customisation of graphics and thresholds for calculations. As a proof of concept, this study demonstrates how MitoBEAP can be utilised to measure the efficiency of DdCBE treatment targeting human 12S rRNA, as well as to identify potentially harmful off-target conversions across the mtDNA.

10

MechAInistic: An LLM-guided Multi-Agent System for Reasoning over Genome-Scale Constraint-Based Metabolic Models

Loecker, J.; Pujara, N.; Bryant, W.; Puniya, B. L.; Packrisamy, P.; Hamed, A.; Helikar, T.

2026-05-13 systems biology 10.64898/2026.05.11.723319 medRxiv

Top 0.1%

11.9%

Show abstract

Constraint-based metabolic modeling is a powerful way to study the mechanistic basis of cellular states and disease, but effective use demands substantial computational expertise and careful coordination of multi-step analyses. We developed MechAInistic to lower this barrier enabling researchers to ask complex biological questions in natural language. MechAInistic is a multi-agent system harnessing large language models organized around an Architect-Reviewer pattern that that converts a natural-language question into an executable, model-grounded workflow and produces a structured report. It supports pathway comparison, perturbation analysis, drug-target exploration, and literature interpretation across healthy and disease paired states. We evaluated MechAInistics therapeutic hypothesis generation using two immune-cell use-cases. For rheumatoid arthritis/healthy Naive B models, it identified mitochondrial metabolic rewiring and nominated Devimistat/CPI-613 as an investigational OGDH-centered hypothesis. In CD4+ Th17 multiple sclerosis/healthy models, the workflow identified NADP-dependent isocitrate dehydrogenase as the optimal target and proposed Ivosidenib as an FDA-approved repurposing candidate. GRAPHICAL ABSTRACT O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=83 SRC="FIGDIR/small/723319v1_ufig1.gif" ALT="Figure 1"> View larger version (19K): org.highwire.dtl.DTLVardef@1b5c1d1org.highwire.dtl.DTLVardef@1c798cforg.highwire.dtl.DTLVardef@10161d3org.highwire.dtl.DTLVardef@1bd7dce_HPS_FORMAT_FIGEXP M_FIG C_FIG

11

Skill-Augmented Frontier Agents Nearly Saturate BixBench-Verified-50

Zhang, X.

2026-05-01 bioinformatics 10.64898/2026.04.28.721523 medRxiv

Top 0.1%

11.8%

Show abstract

Large language model (LLM) agents are increasingly used for biological data analysis, but prior benchmark results have given a mixed picture of whether they are ready for routine bioinformatics work. The original BixBench study reported only [~] 17-21% accuracy for frontier agents on open-answer bioinformatics questions [1]. Subsequent curation of BixBench-Verified-50 removed or revised ambiguous items, revealing much higher performance for modern agents [2]. Here we evaluate three frontier-model configurations on the 50 verified questions using the same local benchmark, prompt structure, answer format, and grading pipeline: GPT-5.4 with Claude Scientific Skills and no web access, Claude Opus 4.7 with Claude Scientific Skills and no web access, and GPT-5.5 with Claude Scientific Skills, bioSkills, and web access. The three configurations achieve 88.0% (44/50), 84.0% (42/50), and 98.0% (49/50) accuracy, respectively. The remaining GPT-5.5 error is not a clear analytical failure: the agent correctly computed Spearman correlations on the distributed CRISPRGeneEffect.csv values and selected CCND1, whereas the reference answer is recovered only after interpreting stronger essentiality as the opposite sign of the raw gene-effect score. Offline errors mainly occurred when agents lacked pathway, organism-annotation, BUSCO, or PhyKIT-related resources. These results show that frontier agents equipped with high-quality scientific skills can nearly saturate a curated bioinformatics benchmark, while also emphasizing that question wording, score sign conventions, and access to current external resources remain decisive for reliable evaluation.

12

amR: an R package suite to predict antimicrobial resistance in bacterial pathogens

Ghosh, A.; Brenner, E. P.; Boyer, E. A.; McKim, A. P.; Vang, C. K.; Wolfe, E. P.; Mayer, D. A.; Lesiyon, R. L.; Ravi, J.

2026-07-13 bioinformatics 10.64898/2026.07.10.734579 medRxiv

Top 0.2%

11.7%

Show abstract

MotivationIdentifying bacterial antimicrobial resistance (AMR) is critical for diagnostics and treatment, but resistance is a complex trait arising from myriad mechanisms spanning multiple molecular scales. Existing computational approaches often function as black boxes and rarely explore cross-species or multi-drug patterns. We developed amR, an integrated R package suite that provides a complete framework from bacterial genome data curation to interpretable AMR predictions, enabling identification of resistance mechanisms across species and drugs. ResultsThe amR R package suite contains three modular packages. amRdata downloads genomes and paired antimicrobial susceptibility testing data from BV-BRC and processes them, constructs pangenomes, and extracts features at gene/protein cluster, protein domain, annotated Clusters of Orthologous Groups and ResFinder AMR-associated features, and structural variant scales; data are stored in memory-efficient formats (Parquet, DuckDB). amRml trains interpretable machine learning models per species-drug combination, calculates feature importance and performance metrics, and provides rich ground for hypothesis generation and mechanism discovery. amRviz provides an interactive Shiny dashboard to explore metadata distributions and model performance across species and drugs, visualize top predictive AMR features, and analyze cross-model patterns across geographic/temporal strata. We apply the suite to Shigella sonnei, achieving a median Matthews Correlation Coefficient of 0.89 across 23 drugs and drug classes. With thousands of genomes, multi-scale features, and interpretable models, amR provides an accessible, comprehensive framework for AMR research. The amR package suite is installable via GitHub (https://github.com/JRaviLab/amR; BSD-3-Clause license).

13

Deep-Interact Studio: An Interactive Deep Learning Model Building Platform for Biomolecular Interaction Prediction

Sarkar, D.; Bardhan, K.; Sarkar, C.

2026-07-07 bioinformatics 10.64898/2026.07.02.736034 medRxiv

Top 0.2%

11.5%

Show abstract

Motivation: Deep learning has rapidly become essential for predicting biomolecular interactions; however, most web-tools expose only a single, pre-built model with a fixed, non-configurable architecture that users cannot redesign, retrain on their own data, or compare; they are typically dedicated to one interaction type and often one species, and report prediction scores with little interpretability. These constraints force researchers across several disconnected, single-purpose tools and limit the flexibility, reproducibility, and long-term usability of existing platforms. Results: We present Deep-Interact Studio, a unified, web-based deep-learning platform that addresses these limitations by shifting interaction prediction from a model-centric to a user-driven, comparative, and interpretable paradigm. Within a single interface spanning all four interaction classes, namely protein-protein, drug-target, RNA-protein, and protein-DNA, users design their own model architectures layer by layer, configure training hyperparameters, and train them on their own data, including custom, species-specific datasets. Multiple user-built models can then be trained under identical conditions and compared side by side at both the training and inference levels, while integrated interpretability, including SHAP-based feature attribution, embedding-space visualization, and interaction hub analysis, turns predictions into auditable, mechanistically grounded results. Deep-Interact Studio is, to our knowledge, the only such platform to combine fine-grained per-layer model customization with multi-model comparison and interpretability, offering a flexible and transparent alternative to fixed, single-purpose tools.

14

fuzzyfold: a high-performance framework for stochastic RNA folding kinetics

Badelt, S.

2026-06-18 bioinformatics 10.64898/2026.06.17.732885 medRxiv

Top 0.2%

11.5%

Show abstract

The analysis of nucleic acid secondary structures is overwhelmingly dominated by methods that analyze the thermodynamic equilibrium distribution and which ignore all dynamic aspects of nucleic acid folding. Yet, there are numerous popular examples of nucleic acid folding that rely on kinetic models, such as RNA riboswitches or DNA strand displacement systems. Here, I am presenting fuzzyfold, a Rust-based software package for nucleic acid secondary structure analysis with an explicit focus on stochastic modeling. The framework introduces three-way and four-way shift moves with a biophysically motivated rate-model parameterization, and it is developed with an emphasis on both model flexibility and performance, e.g. allowing for the generation of single co-transcriptional trajectories for thousand-nucleotide long RNA molecules in just a few minutes. The main strength of the fuzzyfold package, however, is its focus on user and developer interfaces for long-term development. It provides easily installable command-line interfaces, e.g. for aggregating data from multiple parallel trajectories efficiently into an ensemble-level dynamic analysis. For developers, the code-base supports straight-forward substitution of thermodynamic and kinetic free-energy models, and a flexible library interface with Python bindings, enabling integration of individual components into custom computational workflows.

15

geneSync: Gene Symbol Harmonization for Large-scale RNA-seq Data Integration

Feng, Z.; Li, T.

2026-05-07 bioinformatics 10.64898/2026.05.04.722831 medRxiv

Top 0.2%

11.3%

Show abstract

Cross-cohort integration of transcriptomic data is a routine strategy for boosting statistical power and enhancing generalizability. However, gene nomenclature inconsistencies across datasets--arising from annotation version updates, historical renaming, and synonym reassignment--introduce silent mismatches during feature alignment, causing genes to be falsely classified as absent or split into duplicate features. Here, we present geneSync, an R package that performs gene symbol harmonization as a quality-control (QC) step prior to data integration. geneSync uses a hierarchical matching strategy, prioritizing exact matches to authoritative gene symbols, then exact matches to National Center for Biotechnology Information (NCBI) gene symbols, and finally synonym-based fallback. It includes built-in offline databases for human, mouse, and rat, and supports auditable conflict resolution, cross-species ortholog mapping, and native integration with Seurat and SingleCellExperiment objects. Benchmarking across six mouse hippocampus scRNA-seq datasets spanning 2020-2025 and five CellRanger versions shows that 1.41%-6.22% of features require synonym resolution, and harmonization improves pairwise gene overlap by up to 13.14 percentage points, rescuing 707-1,098 genes per dataset pair. Notably, CellRanger annotation version--rather than data collection year--was identified as the primary driver of nomenclature discrepancy. geneSync is freely available at https://github.com/xiaoqqjun/geneSync.

16

EpiESM-GA: Resource-Efficient Protein Foundation Model Features for Equitable B-Cell Epitope Prediction

Gautam, P.; Mitra, P.

2026-06-26 bioinformatics 10.64898/2026.06.22.733745 medRxiv

Top 0.2%

10.7%

Show abstract

Prediction of B-cell epitopes can assist in reducing costly wet-lab screening in vaccine design, diagnostics, and antibody discovery. However, current predictors often suffer from noisy labels, weak generalization, and structure-dependent workflows. Here we present EO_SCPLOWPIC_SCPLOWESM-GA, an efficient sequenceonly pipeline for linear B-cell epitope prediction. Positive and negative peptide examples are collected from IEDB, which provides experimentally tested epitopes and distinguishes positive and negative epitope records based on assay evidence(Vita et al., 2019). Each peptide is encoded with a frozen ESM-2 protein language model: a bidirectional transformer producing amino acid embeddings for downstream structure and function tasks (Lin et al., 2023). Mean-pooled embeddings are further compressed into a compact 420-feature representation with a genetic algorithm and classified with lightweight Random Forest, XGBoost, or MLP heads. This avoids foundation-model fine-tuning, reduces the number of trainable parameters, improves interpretability, and enables low-resource deployment. On an IEDB-derived benchmark, EO_SCPLOWPIC_SCPLOWESM-GA attains 0.880{+/-} 0.004 AUC-ROC, 0.852{+/-} 0.005 PR-AUC, 82.0 {+/-} 0.6% accuracy, 0.79 {+/-} 0.01 F1, and 0.74{+/-} 0.01 MCC, outperforming dense ESM-2 features and baselines LBCE-XGB, EpitopeVec, and BepiPred-2.0 (mean{+/-} std over five independent random seeds). The framework shows how frozen protein foundation models can enable pandemic preparedness, peptide vaccine prioritization, diagnostic antigen screening, and equitable computational immunology.

17

Protein Function Prediction with Pretrained ProtT5 Embeddings and Gradient Boosting

Appel, J.; Butcher, N.

2026-04-28 bioinformatics 10.64898/2026.04.27.721184 medRxiv

Top 0.2%

10.6%

Show abstract

Protein function prediction remains a central challenge in computational biology due to the extreme sparsity and long-tail distribution of Gene Ontology (GO) [1] annotations. Advances in protein language models enable the extraction of dense, fixed-length representations from amino acid sequences, offering a scalable alternative to hand-picked features such as physicochemical properties. In this work, we evaluate a transformer-based embedding approach using ProtT5-XL combined with classical and modern multi-label classifiers for Gene Ontology prediction in the CAFA-6 setting. Fixed-length embeddings were generated via mean pooling of transformer hidden states and used as input to one-vs-rest logistic regression, gradient-boosted decision trees, and a neural network. Models were evaluated on held-out validation data with a focus on threshold selection, prediction sparsity, and behavior across frequent and rare GO terms. Gradient boosting consistently provided the best balance between predictive performance and stable prediction behavior, motivating its use for ontology-specific predictors across molecular function, biological process, and cellular component annotations. This study highlights practical modeling choices for large-scale protein function prediction using pretrained sequence embeddings and provides an interpretable baseline for future CAFA evaluations.

18

Predicting P-glycoprotein Substrate Status Using a Pretrained Graph Neural Network: A TDC Benchmark Study

Yan, J.; Duan, W.

2026-06-04 bioinformatics 10.64898/2026.06.01.729343 medRxiv

Top 0.2%

10.1%

Show abstract

P-glycoprotein (Pgp/ABCB1) is a critical efflux transporter that significantly impacts drug bioavailability and multidrug resistance. Accurate prediction of Pgp substrate status is essential for early-stage drug discovery. In this study, we evaluate a pretrained Graph Iso-morphism Network (GIN) with attribute masking on the Pgp_Broccatelli benchmark from the Therapeutics Data Commons (TDC). Our approach fine-tunes a GIN encoder pretrained on approximately 2 million molecules using a self-supervised attribute masking strategy, followed by a multilayer perceptron (MLP) classification head. On the TDC benchmark, our model achieves an AUROC of 0.937 {+/-} 0.004 across five independent runs, ranking second on the leaderboard, as of May 2026. We further compare this approach against an XGBoost baseline using Morgan fingerprints (AUROC 0.912 {+/-} 0.007), demonstrating the advantage of graph-based molecular representations with transfer learning for small-dataset ADMET prediction tasks.

19

Evaluating agentic AI for biological discovery in autonomous and copilot settings

Johri, S.; Pimenta, E. M.; Yates, J.; Fu, J.; Bao, E. L.; Jun, H.; Reardon, B.; Bacot, S.; Shady, M.; Fu, D.; Mei, W.; Camp, S. Y.; Park, J.; Van Allen, E.

2026-06-09 cancer biology 10.64898/2026.06.04.729919 medRxiv

Top 0.2%

9.9%

Show abstract

Advances in large language models (LLMs)-based artificial intelligence (AI) agents have improved their ability to execute structured analytical workflows, including standard bioinformatic pipelines for biological discovery. However, computational biology rarely consists of deterministic pipeline execution alone. Biological datasets are heterogeneous and noisy, and meaningful discovery often requires open-ended hypothesis generation and iterative reasoning over multimodal evidence. These challenges are particularly evident in multi-omic studies, where paired molecular modalities and heterogeneous clinical contexts create both opportunities and obstacles for discovery. The extent to which emerging agentic AI systems can support or automate this mode of scientific discovery remains poorly understood. Here, we systematically evaluated the capabilities and limitations of agentic AI for biological discovery using multi-omic single cell datasets spanning 11 cancer types. We developed the Multistep Multimodal Multiomic Agentic (M3A) Framework to support LLM-driven reasoning over persistent multimodal data states and to capture agentic reasoning behavior in autonomous and human-AI copilot settings. Using this framework, we assessed AI agents across complementary tasks, including autonomous cell-type annotation, generation of falsifiable biological hypotheses from gene programs, and copilot experiments testing the effect of human involvement and domain expertise. We found that current AI agents are effective at broad, systemic exploration of complex data, whereas domain experts remain critical for methodological guidance and biological synthesis across analyses. Together, our results delineate the current potential and boundaries of agentic AI in computational biology, and establish a framework for evaluating AI systems designed to support biological discovery.

20

An Open-Source Reproducible Workflow for Pocket-Oriented Virtual Screening and ADME-Integrated Chemoinformatics: A Multi-Target Flavivirus Case Study

Teixeira, J. P.; Bajay, M. M.; Freire, C. C. d. M.; Bettin, L. B. F.; Soares, A. P.; de Lima Neto, D. F.

2026-04-29 bioinformatics 10.64898/2026.04.28.721199 medRxiv

Top 0.2%

9.8%

Show abstract

Zika virus (ZIKV), yellow fever virus (YFV), West Nile virus (WNV), Usutu virus (USUV), and Saint Louis encephalitis virus (SLEV) remain major public health concerns, yet broad-spectrum antiviral options are limited. Here, we present an open-source, reproducible software workflow for pocket-oriented virtual screening and ADME-integrated chemoinformatics, designed to support standardized multi-target compound prioritization. As a case study, the workflow was applied to structural and nonstructural proteins from clinically relevant flaviviruses. Automated pocket detection using Concavity reduces site-selection bias by generating docking boxes from surface concavity clusters, while standardized downstream scripts parse docking logs, convert docking-derived binding energies into Kd-related metrics, integrate SwissADME descriptors, and compute LE, LLE, FQ, and drug-likeness rules. The framework also supports retrospective validation and comparative benchmarking using literature-supported reference compounds and target-specific plausibility checks. Rather than proposing experimentally validated antiviral candidates, this study provides a reusable computational framework for hypothesis generation, benchmarking, and downstream experimental prioritization in structure-based drug discovery. The workflow is modular and adaptable to other multi-target screening campaigns where integrated ranking across binding, physicochemical, and ADME dimensions is required. SUMMARYWe describe an open-source, reproducible software workflow that integrates pocket-oriented docking, ligand efficiency scoring, ADME descriptor integration, and multivariate chemoinformatics to standardize compound prioritization across multiple protein targets. The workflow combines open-source tools with auditable Bash, R, and Python scripts and is demonstrated through a multi-target flavivirus case study. Rather than claiming experimentally validated antiviral activity, the framework is intended to support hypothesis generation, retrospective benchmarking, transparent reporting, and downstream experimental prioritization.